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Abstract 

This article proposes diffusion LMS strategies for distributed estimation over adaptive networks that 
are able to exploit sparsity in the underlying system model. The approach relies on convex regularization, 
common in compressive sensing, to enhance the detection of sparsity via a diffusive process over the 
network. The resulting algorithms endow networks with learning abilities and allow them to learn the 
sparse structure from the incoming data in real-time, and also to track variations in the sparsity of the 
model. We provide convergence and mean-square performance analysis of the proposed method and 
show under what conditions it outperforms the unregularized diffusion version. We also show how to 
adaptively select the regularization parameter. Simulation results illustrate the advantage of the proposed 
filters for sparse data recovery. 

Index Terms 

Diffusion LMS, adaptive networks, compressive sensing, distributed estimation, sparse vector. 

I. Introduction 

We consider the problem of distributed mean-square-error estimation, where a set of nodes is required 
to collectively estimate some vector parameter of interest from noisy measurements by relying solely 
on in-network processing. We consider an ad-hoc network consisting of N nodes that are distributed 
over some geographic region. At every time instant i, every node k collects a scalar measurement dt-(i) 
of some random process dk(i) and a 1 X M regression vector Uk : i of some random process Uk,i with 
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covariance matrix R Uy k = Ett^ > 0. The objective is for the nodes in the network to use the 
collected data {dk(i),Uk : i} to estimate some M X 1 parameter vector w" in a distributed manner. 

There are a couple of distributed strategies that have been developed in the literature for such purposes. 
One typical strategy is the incremental approach [HI-EL where each node communicates only with one 
neighbor at a time over a cyclic path. In the incremental strategy, information is processed in a cyclic 
manner across the nodes of the network until optimization is achieved. However, determining a cyclic 
path that covers all nodes is an NP-hard problem [6] and, in addition, cyclic trajectories are prone to 
link and node failures. When any of the edges along the path fails, the sharing of data through the 
cycle is interrupted and the algorithm stops performing. To address these difficulties, adaptive diffusion 
techniques were proposed and studied in |£7], O. In diffusion implementations, the nodes exchange 
information locally and cooperate with each other without the need for a central processor. In this way, 
information is processed on the fly by all nodes and the data diffuse across the network by means of a 
real-time sharing mechanism. The resulting adaptive networks exploit the time and spatial-diversity of 
the data more fully, thus endowing networks with powerful learning and tracking abilities. In view of 
their robustness and adaptation properties, diffusion networks have been applied to model a variety of 
self-organized behavior encountered in nature, such as birds flying in formation lPT2l . fish foraging for 
food [13] or bacteria motility [14]. Diffusion adaptation has also been used to solve dynamic resource 
allocation problems in cognitive radios [15], to perform robust system modeling |[T8l . and to implement 
distributed learning over mixture models in pattern recognition applications |[T6l . 

In many situations, the parameter of interest, w°, is sparse, containing only a few relatively large 
coefficients among many negligible ones. Any prior information about the sparsity of w° can be exploited 
to help improve the estimation performance, as demonstrated in many recent efforts in the area of 
compressive sensing (CS) [27]-[29[. Nevertheless, most CS efforts so far have concentrated on batch 
recovery methods, where the estimation of the desired vector is achieved from a collection of a fixed 
number of measurements. In this paper, we are instead interested in adaptive (online) techniques that 
allow the recovery of the vector w° to be pursued both recursively and distributively as new data arrive 
at the nodes. More importantly, we are interested in schemes that allow the recovery process to track 
changes in the sparsity pattern of the vector over time. Such schemes are useful in several contexts 
such as in the analysis of prostate cancer data |f28l , lEffl . spectrum sensing in cognitive radio [45], and 
spectrum estimation in wireless sensor networks ll46l . 

Some of the early works that mix adaptation with sparsity-aware constructions include methods that 
rely on the heuristic selection of active taps [20|-[22|, and on sequential partial updating techniques 1231 - 
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ll24lk some other methods assign proportional step-sizes to different taps according to their magnitudes, 
such as the proportionate normalized LMS (PNLMS) algorithm and its variations [25]-[26[. In subsequent 
studies, motivated by the LASSO technique [28] and by connections with compressive sensing [29], [30], 
several algorithms for sparse adaptive filtering have been proposed based on LMS [34]-[35], RLS 1136*1 . 
||3"7l , and projection-based methods II381 - IT391 . A couple of distributed algorithms implementing LASSO 
over ad-hoc networks have also been considered before, although their main purpose has been to use the 
network to solve a batch processing problem lf40l . BTI . One basic idea in all these previously developed 
sparsity-aware techniques is to introduce a convex penalty term into the cost function to favor sparsity. 

However, these earlier contributions did not consider the design of both adaptive and distributed 
solutions that are able to exploit and track sparsity while at the same time processing data in real- 
time and in a fully decentralized manner. Doing so would endow networks with learning abilities and 
would allow them to learn the sparse structure from the incoming data recursively and, therefore, to 
track variations in the sparsity pattern of the underlying vector as well. Investigations on adaptive and 
distributed solutions appear in [42 1,[43 1, and II441 . In If42"l . we employed diffusion techniques that are able 
to identify and track sparsity over networks in a distributed manner; the techniques relied on the use of 
convenient convex regularization terms. In the related work lf43l . the authors employ projection techniques 
onto hyperslabs and weighted t\ -balls to develop a useful sparsity-aware algorithm for distributed learning 
over diffusion networks. In P4 I. the authors use the same formulation of [42] and the techniques of 0- 
|f8l to independently arrive at useful diffusion strategies except that they limit the function /(•) in © to 
choices of the form for particular selections of p-vector norms; they also include the regularization 

factor into the combination step of their algorithm rather than in the adaptation step, as done further 
ahead in this work. The algorithms proposed here and in 11421 are more general in a couple of respects: 
they allow for broader choices of the regularization function /(•), they allow for sharing of both data 
and weight estimates among the nodes (and not only estimates) by allowing for the use of two sets of 
combinations weights {a;^, instead of only one set, and the resulting mean-square and stability 
analyses are more demanding due to these generalizations; see, e.g., Appendices A and B. We further 
use the results of the analysis to propose an adaptive method to adjust online the important regularization 
parameter 7 in Q. This is an important step in order to endow the resulting diffusion strategies with full 
adaptation abilities: adaptation to nonstationarities in the data and to changes in the sparsity patterns of 
the data. 

The approach we follow in this work is based on developing diffusion strategies that are stable in the 
mean-square-error sense, with guaranteed performance bounds. For this reason, a detailed mean-square- 
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error analysis is carried out in order to examine the behavior of the algorithm in the presence of noisy 
measurements and random regression data. The analysis ends up suggesting a convenient method for 
adapting the regularization parameter in a distributed manner as well. Doing so, enhances the sparsity- 
awareness of the algorithm and adds another useful layer of adaptation to the operation of the network. 
In summary, in this paper we extend our preliminary work in Il42l to develop adaptive networks running 
diffusion algorithms subject to constraints that enforce sparsity. We consider two convex regularization 
constraints. In one case, we consider an £i-norm regularization term, which acts as a uniform zero- 
attractor. In another case, and in order to improve the estimation performance, we employ reweighted 
regularization to selectively promote sparsity on the zero elements of w°, rather than uniformly on all 
the elements. We provide detailed convergence analysis of the proposed methods, giving a closed form 
expression for the bias on the estimate due to regularization. We carry out a mean-square-error analysis, 
showing the conditions under which the sparse diffusion filter outperforms its unregularized version in 
terms of steady-state performance. It turns out that, if the system model is sufficiently sparse, it is possible 
to tune a single parameter to outperform the standard diffusion algorithm. Then, on the basis of this result, 
we propose a method to adaptively choose the regularization parameter. In this way, the network is able 
to learn the sparse structure from the incoming data recursively and to adjust its parameters for improved 
tracking. The main contributions of this paper are therefore: (a) exploitation of sparsity for distributed 
estimation over adaptive networks; (b) derivation of the mean-square properties of the sparse diffusion 
adaptive filter; (c) and adaptation of the regularization parameter to enhance performance under sparsity. 

The paper is organized as follows. In Section II we develop the sparse diffusion algorithm for distributed 
adaptive estimation. Section III provides performance analysis, which includes mean stability, mean- 
square performance and adaptation of the regularization parameter. Section IV provides simulation results 
in support of the theoretical analysis, and section V draws some conclusions. 

Notation: we use bold face letters to denote random variables and normal font letters to denote their 
realizations. Matrices and vectors are respectively denoted by capital and small letters. 

II. Sparse Distributed Estimation Over Adaptive Networks 

We assume the data {d k (i), u k -i\ collected by the various nodes are related to an unknown sparse 
vector w° via the linear model: 

d k (i) = u kii w° + v k (i) (1) 

where v k (i) is a zero mean random variable with variance <7^ k , independent of u\ j for all I and j, and 
independent of vi(j) for I ^ k and i / j. Linear models of the form (Q3 arise frequently in applications 
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and are able to represent many cases of interest. The cooperative sparse estimation problem is cast as the 
search for the optimal estimator that minimizes in a fully distributed manner the following cost function: 

N 

J gloh H = E KW - U *M 2 + 7/M (2) 

k=l 

where E(-) denotes the expectation operator, and f(w) is a real-valued convex regularization function 
weighted by the parameter 7 > 0, which is used to enforce sparsity of the solution. The optimization 
problem in © may be solved in a centralized fashion. In this approach, the nodes send their data to a 
central processor, or fusion center, where all computations can be performed. Centralized implementations 
of this type require transmitting data back and forth between the nodes and the central processor, which 
translates into requirements of power and bandwidth resources. Additionally, centralized solutions have 
a serious point of failure: if the central processor fails, then the network operation is adversely affected 
and operation comes to a halt. For these reasons, we are interested in distributed solutions, where each 
node communicates with its neighboring nodes, and processing is distributed among all nodes in the 
network. In this way, communications are localized, and even when individual nodes fail, the network 
can continue to operate. 

A. Adaptive Diffusion Strategy 

We follow the approach proposed in JS], iTTTI to derive distributed strategies for the minimization of 
J glob (w) in ©. We start by introducing an N x N matrix C with non-negative entries {q ^} such that 

c l)k >0 if leN k , Cl = l, (3) 

where 1 denotes the N x 1 vector with unit entries and A/fc denotes the neighborhood of node k. Each 
coefficient represents a weight value that node k assigns to information arriving from its neighbor I. 
Of course, the coefficient q & is equal to zero when nodes I and k are not connected. Furthermore, each 
row of C adds up to one so that the sum of all weights leaving each node / should be one. Using the 
coefficients in (O, the global cost function in (O can be expressed as: 

J^ h (w) = Jl°» + £ + 7/H (4) 

l^k 

where 

N 

Ji° c (w) = %k®\M*) ~ u hM 2 (5) 
1=1 

The function introduced in © is a local (neighborhood) cost for node k; it involves a weighted combi- 
nation of the costs of its neighbors without considering the sparsity constraint. Assuming the processes 
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dk(i) and t*& $ are jointly wide sense stationary, the minimization of the local cost function © over w 
leads to the optimal local solution: 



,„loc 

w k 



f N \ 1 / N \ 

^ Q,kR u ,i ^ ci t krdu,l (6) 
U=i / \j=i / 



where R u ^ = Eu^w^j is assumed positive-definite (i.e., R u ^ > 0) and r^fc = Edfc(i)w£ •, where 
the operator * denotes complex-conjugate transposition. Thus, the local estimate w\ oc is based solely on 
local covariance data {R u ,i, rd u ,i}ieM k from the neighborhood of node k. If we multiply both sides of 
(OQ) by u* k i and take expectations and then add over the neighborhood of node k, it is easy to verify that 
the estimate w l k oc from © agrees with the desired vector w°. Therefore, in principle, each node k can 
estimate w° if it knows the moments {R u ,i,rd u ,i}- Often, in practice, these moments are not available and 
nodes only sense realizations of data arising from these statistical distributions. In that case, cooperation 
among the nodes can help them improve their estimates of w° from the data realizations. To motivate 
the cooperative procedure, we start by noting that a completion of squares argument shows that d5j can 
be rewritten in terms of wj? c as 

f k oc (w) = \\w - w l k oc \\l k + mmse (7) 

where mmse is a constant term that does not depend on w, the notation ||a||| = a*£a, for any nonnegative 
definite matrix E, and 

N 

Tfc — "^2, C l,kRu,l- (8) 

1=1 

Thus, using (01), ® and (O, and dropping the constant mmse terms, we can replace the original global 
cost (fj) with the equivalent cost: 

j gloh '(w) = %kmi(i) - uiM 2 + E - ^iir, + -rfw ( 9 ) 

Expression © shows how the local cost J k oc (w) can be modified to approach the desired global cost; 
two correction terms appear on the right-hand side: the regularization term "yf{w) and a sum involving 
the local estimates {w\ oc }. Node k cannot minimize (O directly. This is because the cost in (© still 
requires the nodes to have access to global information, namely, the local estimates {w\ oc }, and the 
matrices {T/}, from all other nodes in the network. To enable a distributed and iterative procedure, we 
make three adjustments to d9j. 

First, we limit the sum over I ^ k to a sum over the neighbors of node k, i.e., only over I G A/fc. 
This step is reasonable since node k can only rely on data that are available to it from its neighborhood. 
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Second, we replace the covariance matrices Ti with constant diagonal weighting matrices of the form 
= bikini, where b^k is a set of non-negative real coefficients that give different weights to different 
neighbors, and lu is the M x M identity matrix. Although the {1^} from its neighbors are available to 
node k, this step is meant to simplify the structure of the resulting algorithm. This substitution is also 
reasonable in view of the fact that norms are equivalent and that each of the weighted norms in (9) can 
be bounded as 

A mi „(ro • || W - w x r\\ 2 < \\ w - w i r\\i t < A max (r,) • || W - w i n\ 2 m 

Substitutions of this kind are common in the stochastic approximation literature where Hessian matrices, 
such as Ti, are replaced by multiples of the identity matrix; such approximations allow the use of simpler 
steepest-descent iterations in place of Newton-type iterations [11]. At this stage, we do not need to worry 
about the selection of the weights {bi t k} because they are going to be embedded into another set of 
coefficients that the designer can choose. Finally, while the nodes are attempting to estimate w°, they do 
not know what the optimal local estimates u;J oc are during the iterative learning process. As the ensuing 
discussion will reveal, each node in the resulting distributed algorithm will be working on estimating the 
sparse vector w° and will have access to a local estimate for w°, which we denote by tpi at node /. Due 
to the diffusion process, this estimate will not be based solely on data from the neighborhood of node 
I but also on data from across the network. We are therefore motivated to replace w] oc in (© by ipi. In 
this way, each node k can instead minimize the following modified local cost function: 

■4 dist H = J2 candid) - u l>iW \ 2 + £ b^ww-^f + 7 f( w ). (ii) 

leMk ieAr k /{k} 
The cost in (ITTb is now defined in terms of information that is available to node k. Observe that while 
(TTTb is a local approximation for the global cost ©, it is nevertheless more general than the local cost 
©. The node k can then proceed to optimize ([Til by means of a steepest-descent procedure. Note that 
all functions in (TTTb are continuously differentiable except possibly f(w), which is only supposed to be 
convex. Thus, computing the sub-gradient of (TTTb we obtain 

[V J, dist H]* = ^ c^Rujw - r du ,i) + Y, bl ^ w ~ ^ + (12) 
leAfk ieAfk/{k} 

where df(w) is the sub-gradient of the convex function f(w). Then, we can use (TTH ) to obtain a steepest 
descent recursion for the estimate of w° at node k at time i, denoted by Wk,i, such as 

Wk,i = w k) i-i + fJ-k c i,k(rdu,i ~ Ru,iWk,i-i) + k t k(ipi - Wk,i-i) - fJ-kldf(w k ,i-i) (13) 

leAfk ieM k /{k} 
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for some sufficiently small positive step-sizes {/x^}. The update (TT3T ) involves the sum of three terms and 
we can compute it in two steps by generating an intermediate estimate ipk,i* as follows: 

lpk,i = w k,i-l + l^k 53 c l,k( r du,l ~ Ru,l w k,i-l) ~ ^kl^fi^kfi-l) (14) 

w k ,i = ^k,i + Vk 53 b iA4>l - wjm-i) ( 15 ) 

l£Af k /{k} 

Since every node in the network will be running recursions of the form (fl4l) -([T5Tl. then the intermediate 
estimate of w° that is available to each node I at time i is tfii^. Therefore, we replace tpi in (031 ) by ipi^. 
Moreover, since ipk,i is an updated estimate relative to Wk i-x, as evidenced by (fT4b . we are motivated to 
replace w^i-i i n (TT3T > by ipki, which generally leads to enhanced performance since ip^i contains more 
information than w^ i-i. This step is reminiscent of an incremental-type substitution CD-El. Performing 
these substitutions in ([131) . we get: 

Wk,i = ipk,i + V>k 53 h,k{i>l,i ~ i>k,i) (16) 
letf k /{k} 

If we now introduce the entries of an N x N matrix A = {aik} such that 



A 

O-Lk 



fJ-kh,k (l^k), a k:k = 1 - fjL k 53 h,k, a>l,k = if / £ Afk (I 7 ) 

ieAf k 

then, we can rewrite the update in (PT4Ti-([T5Tl as: 



tpk,i = Wk,i-i + Mfc 53 c i,k( r du,i - R u ,iWk,i-i) - Lik7df(wk,i-i) 
ieM k 



Wk,i = 53 ai ^ l r 



(18) 



where the weighting coefficients {atjfc,Q fc} are real, non-negative and satisfy: 

Q,fc>0, a^>0 if l£Nk, Ct = t, A T t = t. (19) 

The recursion in <HTS\ requires knowledge of the second-order moments {R Ui k, r du,k}- An adaptive 
implementation can be obtained by replacing these second-order moments by local instantaneous ap- 
proximations, say, of the LMS type, as follows: 

R u,k - u* kji u k ,i, r dUtk ~ d k (i)u* k i . (20) 
Thus, substituting the approximations (1201 into (1T8T ), we arrive at the following Adapt-then-Combine 



(ATC) strategy. We refer to the algorithm as the ATC sparse diffusion algorithm. The first step in (J2TJ 
is an adaptation step, where the coefficients q k determine which nodes I £ N k should share their 
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ATC sparse diffusion LMS 



Start with Wk-i = for all k. Given non-negative real coefficients {a/,fc, satisfying ( fT9l ), for each 
time i > and for each node k, repeat: 

VV = + fJ-k ^2 °i,kuli[di(i) - ui t iW k) i-i] - Hkldfiwk^x) (adaptation step) 

leM k (21) 

Wk,i = ^2 %ki>l,i (diffusion step) 

«e Ad- 



measurements {di(i),ui : i} with node k. The second step is a diffusion step where the intermediate 
estimates tpi^, from the neighbors I £ are combined through the coefficients {a/,fe}. We remark that 
had we reversed the steps (fT4T > and (fT3T l to implement (TT3T >. we arrive at a similar but alternative strategy, 
known as the Combine-then- Adapt (CTA) strategy; in this implementation, the only difference is that 
data aggregation is performed before adaptation (see, e.g., flU). The complexity of the sparse diffusion 

CTA sparse diffusion LMS ~ 

Start with Wk-i = for all k. Given non-negative real coefficients {a^k, Q,fc} satisfying (TT9l) . for each 
time i > and for each node k, repeat: 

ipk,i-i = ^2 a l,kWi,i-i (diffusion step) 

«eA4 ( 2 2) 

w k ,i = ipk,i-l +Vk^2 c l,k u *,A d l( i ) - ui,iipk,i-l] - Vkpdf(ipk,i-i) (adaptation step) 



schemes in (l2TT>-(f22T> is 0(3M), which is the same complexity as standard stand-alone LMS adaptation. 
It was argued in [ 8 1 that ATC strategies generally outperform CTA strategies. For this reason, we continue 
our discussion by focusing on the ATC algorithm (l2TT l: similar analysis applies to CTA. 

Compared with the strategies proposed in B31 and P4l . the diffusion algorithm d2TT > exploits data 
in the neighborhood more fully; the adaptation step aggregates data {d[(i),ui t i} from the neighbors, 
and the diffusion step aggregates estimates {ipi,i} from the same neighbors. The implementation in P31 
uses a different algorithmic structure with C = I so that data {di(i), u^i} from the neighbors are not 
directly used. Compared with [44], observe that the effect of the regularization factor in (|2H influences 
the adaptation step, and not the combination step as in PRl . Observe also that the adaptation step allows 
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for the exchange of data {di(i),ui i} among the nodes through the use of the coefficients {q^}, whereas 
Il44ll uses C = I as well. 



B. Sparse Regularization 

Before proceeding with the discussions, let us comment on the regularization function f(w) in (0. 
A sparse vector w° generally contains only a few relatively large coefficients interspersed among many 
negligible ones and the location of the non-zero elements is often unknown beforehand. However, in 
some applications, we may have available some upper bound on the number of nonzero elements. Thus, 
assume that w° satisfies 

Kilo < r, (23) 

where || • ||q is the iV norm > denoting the number of non-zero entries of a vector, and r is a known upper 
bound. Since the ^cr norm in (1231 is not convex, we cannot use it directly. Thus, motivated by LASSO 
ll28l and work on compressive sensing [29], we first consider the following ^i-norm convex choice for 
a regularization function: 

M 

fi(w) = \\w\\\ = \w m \ (24) 

m=l 

which amounts to the sum of the absolute entries of the vectors. The ^i-norm works as a surrogate 
approximation for the 4r norm - This choice leads to an algorithm update in (1211 where the subgradient 
column vector is given by 

dfi(w) = sign(io) (25) 



and the entries of the vector sign(io) are obtained by applying the following function to each entry of w: 

sign(io m ) 



( 

W m /\w m \, W m ^0 

(26) 



0, w m = 

This update leads to what we shall refer to as the zero-attracting (ZA) diffusion algorithm. The ZA 
update uniformly shrinks all components of the vector, and does not distinguish between zero and non- 
zero elements ||3"0l , ||34l . Since all the elements are forced toward zero uniformly, the performance would 
deteriorate for systems that are not sufficiently sparse. Motivated by the idea of reweighting in compressive 

sampling |30l , j34l , |[38l , we also consider the following approximation: 

M | | 

Hlo - £ (27) 
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which, for very small positive values of e, is a better approximation for the l^-norm of a vector w 
than the £i-norm [30], thus enhancing sparse recovery by the algorithm. Therefore, interpreting (1271 
as a weighted ^i-norm regularization, to update the algorithm in d2TV we shall consider the following 
sub-gradient column vector: 

1 1 1 



dfi{w) = diag 



e + \wi\ ' e + \w 2 \ ' " ' ' e + \w M \ 



sign(w ) 



(28) 



This choice leads to what we shall refer to as the reweighted zero-attracting (RZA) diffusion algorithm. 
The update in (f28T > selectively shrinks only the components whose magnitudes are comparable to e, and 
there is little effect on components satisfying \w m \ > e, see, e.g., EH, OH, EH, EH, ll44l . 



III. Mean-Square Performance Analysis 

From now on, we view the estimates Wk,i as realizations of a random process wj.^ and analyze the 
performance of the sparse diffusion algorithm in terms of its mean-square behavior. To do so, we introduce 
the error quantities Wk,i = w° — Wk,i, il>k,i = w° — and the network vectors: 



W; 



Wl,i 




Wl,i 








, Wi = 








W N ,i 




W N ,i 







(29) 



We also introduce the block diagonal matrix 

M = diag{/xi/ A/ , . . . , ^ N hi} 
and the extended block weighting matrices 

C = C®Im, A = A®I m 



(30) 



(31) 



where ® denotes the Kronecker product operation. We further introduce the random block quantities: 

N N 

(32) 



Di = diag^ ^ci >1 ul i u l;i ,...,'^2ci >N ul i u l;i I 
^ l=i i=i > 



9i = C^c6L{u* ti vi(i) i ...,u^ fti v N (i)} 
Then, we conclude from (|2~T1) that the following relations hold for the error vectors: 

^ = Wi-i - M [DiWi-i + g { ] + jMdf(wi-i) 



W, 



(33) 

(34) 
(35) 
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where 

df(wi-{) = col{df(wx,i-x), ■ ■ df(w Nii -x)} (36) 
We can combine (l34l) and (l35l) into a single recursion: 

(37) 



tbi = A T [I - MDi]w t _x - A T M gi + iA r Mdf{wi-T) 



This relation tells us how the network weight-error vector evolves over time. The relation will be the 
launching point for our mean-square analysis. To proceed, we introduce the following independence 
assumption on the regression data. 

Assumption 1 (Independent regressors) The regressors u^i are temporally white and spatially inde- 
pendent with R u k = Eitfc j«£ ^ > 0. ■ 
It follows from Assumption 1 that Uk,% is independent of {wij} for all I and j < i — 1. Although not 
true in general, this assumption is common in the adaptive filtering literature since it helps simplify the 
analysis. Several studies in the literature, especially on stochastic approximation theory [47 1-[48 |, indicate 
that the performance expressions obtained using this assumption match well the actual performance of 
stand-alone filters for sufficiently small step-sizes. Therefore, we shall also rely on the following condition. 
Assumption 2 (Small step-sizes) The step-sizes {pk} are sufficiently small so that terms that depend on 
higher-order powers of p^ can be ignored. ■ 

A. Convergence in the Mean 
Let 

, N N 

V = EDi = diagi J^q,!^, . . . , c l,^Ru,l \ <38) 
^ i=i i=i > 

Then, taking expectations of both sides of (f3Tb and calling upon Assumption 1, we conclude that the 

mean-error vector evolves according to the following dynamics: 

' (39) 



Eibi = A 1 [I - MV]Ewi-x + -fA 1 MEdfiWi-x 



The following theorem guarantees the asymptotic mean stability of the diffusion strategy (1211 . and 
provides a closed form expression for the weight bias due to the use of the regularization term. 
Theorem 1 (Stability in the mean) Assume data model ([7]) and Assumption 1 hold. Then, for any 
initial condition and any choice of the matrices A and C satisfying ( U9D , the diffusion strategy < \21i 
asymptotically converges in the mean if the step-sizes are chosen to satisfy: 

2 

< /ife < 7 r k = l,...,N (40) 
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where X max (X) denotes the maximum eigenvalue of a Hermitian positive semi-definite matrix X. Fur- 
thermore, as i — > oo, the estimators across all nodes have biases that are given by the corresponding 
entries in the following bias vector: 



bias = lim Mibi = 7 • B ■ lim Edf(wi-i) (41) 

i— >oo i—too 

where 

B=[I-A T {I- MV)] _1 A T M. (42) 

Moreover, it holds that 

\\bias 6 oo < j (43) 

where II • ILoo w the block maximum norm of a vector (defined in Appendix A), u max = max u fc , 

k=l,...,N 

9/max = max, ||3/(ii;j_i)||;, i00 and 5 = p(I — MD) < 1, with p(X) denoting the spectral radius of a 
matrix X. 

Proof: See Appendix A. ■ 

B. Convergence in Mean-Square 

We now examine the behavior of the steady-state mean-square deviation, E||xo& ^\\ 2 as i — > 00, for any 
of the nodes and derive conditions under which the sparse diffusion filter outperforms its unregularized 
version in terms of steady-state performance. In particular, we will show that, if the vector parameter w° 
is sufficiently sparse, then it is possible to tune the sparsity parameter 7 to achieve better performance 
than the standard diffusion algorithm. Following the energy conservation framework of d, ||8] and under 
Assumption 1, we can establish the following variance relation: 

E||t&i||| = E||«7i_i|||, +E[g*MAY l A T Mg i ] + 2'jEdf{w i ^ 1 ) T MAT l A T (I - MV)w^ x 

+ 7 2 E||a/(^_ 1 )||^ 2 _ 4rA/( (44) 

where S is any Hermitian nonnegative-definite matrix that we are free to choose, and 

£' = E{I - DiM)A^A T {I - MDi) (45) 

Relations (|44"1)-(|451) can be derived directly from (l37l i if we compute the weighted norm of both sides 
of the equality and use the fact that g i is independent of Wj-i and tWj_i. We can rewrite (l44l more 
compactly if we collect the terms depending on 7 as 

Ell^Hl =E||«7 i _i|||, +E[g*MAT.A T Mg i ] +^(7) (46) 
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with 



<hM = 7 ^( 7 -^f) (47) 

/3 E)i = EUS/CtOi-OII^ME^^O (48) 

a Eji = -2E<9/(™;_i) T .A/M£„4 T (/ - .MP) u>i_i (49) 

Moreover, setting 

G = K[g t g*] = C T ■ diag{o* Ai, • • • , ^jv^.iv} • C (50) 
we can rewrite (l46T l in the form 

EH&illl = E||i6i_i|||j, + Tr[£„4 T .MaM„4] + S)i ( 7 ) (51) 

where Tr(-) denotes the trace operator. Let a = vec(S) and a' = vec(S'), where the vec(-) notation stacks 
the columns of S on top of each other and vec~ 1 (-) is the inverse operation. We will use interchangeably 
the notation \\w\\ 2 and \\w\\j^ to denote the same quantity w*T,w. Using the Kronecker product property 
vec(UT>V) = (V T <g>U)vecCE), we can vectorize both sides of (l45l and conclude that (|45T ) can be replaced 
by the simpler linear vector relation: a' = vec(S') = To, where F is the following N 2 M 2 x N 2 M 2 
matrix with block entries of size M 2 x M 2 each: 

T = (J ® /){/ — J (DM) - (P T 7W) ® I + E(£>f M) 8) (DjM)}(,4 .4) (52) 

Using the property Tr(EX) = vec(X T ) T o we can then rewrite (1511 as follows: 

E||t&i||£ = EH^ill^ + [vec(A T Mg T MA)] T a + ^(7) (53) 

The following theorem guarantees the asymptotic mean-square stability (i.e., convergence in the mean 
and mean-square sense) of the diffusion strategy (|2T| |. 

Theorem 2 (Mean-Square Stability) Assume the data model ([7]) and Assumption 1 hold. Then, the 
sparse diffusion LMS algorithm rt2iD will be mean-square stable if the step-sizes are sufficiently small 
and satisfy ( 1401 ), and the matrix J- in ( 1521 ) is stable. 

Proof: See Appendix B. ■ 
Remark 1: Note that the step sizes influence d52l through the matrix M.. Since the step-sizes are 
sufficiently small, we can ignore terms that depend on higher-order powers of the step-sizes. Then, we 
can approximate (l52l as 



T^(KgI){l-Kg (VM) - (V T M) <g> I + V T M (g VM} (Ag)A)=B T (g) B* (54) 
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where B = A T {I- MV). Now, since A is left-stochastic, it can be verified that the above T is stable if 
(I — VM) is stable ifTTI . Q; this latter condition is guaranteed by (l40l) . In summary, sufficiently small 
step-sizes ensure the stability of the diffusion strategy in the mean and mean-square senses. ■ 
Taking the limit as i — > oo (assuming the step-sizes are small enough to ensure convergence to a 
steady-state), we deduce from (l53l l that: 

lim E\\w£j_~ a = [vec(A T MG T MA)] T a + 7 /3s,oo f 7 " ^) (55) 

where 



a S,oo 
/?S,oo 



lim -2Ed / '(«>i_i) T MAE A T (I - MV) 



.limElia/K-i)!!^^ 



M 



(56) 
(57) 



The limits in d56l)-(l57[) exist. Indeed, first, in Appendix C we show that a^i converges to as i00 . Second, 
we also show in Appendix B that the LHS of ( f55l) converges. Therefore, the term /3s,oo also exists. 

Expression (|55l l is a useful result: it allows us to derive several performance metrics through the proper 
selection of the free weighting parameter a (or £), as was done in (SJ. For example, the MSD for any 
node k is defined as the steady-state value E||«?fc i|| 2 , as i — > oo, and can be obtained by computing 
linij^oo E||uJj||y with a block weighting matrix that has the M x M identity matrix at block (k, k) and 
zeros elsewhere. Then, denoting the vectorized version of the matrix by = vec(diag(efc) ® Im), 
where e& is the vector whose A;-th entry is one and zeros elsewhere, and if we select a in (I55T ) as 
cjfc = (I — F)~ l tk, we arrive at the MSD for node k: 



MSD A 



[veciA 1 MQ MA)] 1 (I - F)~% + 7 /3s fc 



7 



fc,00 



Sfc,00 



The average network MSD nct is given by: 

MSD nct : 



1 N 

lim — VEI 

«-s>oo N ^ 
k=l 



Wk.i 



(58) 



(59) 



Then, to obtain the network MSD from (1551 ). the weighting matrix of lim^oo EHtWiH^ should be chosen 
as T = Imn/N. Let q denote the vectorized version of Imn, i-e., q = vec(/Miv)> anc ^ selecting a in 
d55l ) as a = (I - T)~ x q/N, the network MSD is given by: 



MSD 



nit = ^[vec(A T Mg T MA)] T (I-Fr 1 q + ^(3^ 



7 



as,: 



(60) 
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C. Comparison with Unregularized ATC Diffusion 

We now examine under what conditions the sparse diffusion filter (12Tb dominates in terms of mean- 
square performance its unregularized counterpart when 7 = 0. Considering the MSD expression (I58T ) 
at the k-th node, we notice that the first term on the RHS coincides with the MSD of the standard 
diffusion algorithm when 7 = (compare with (48) in [8|), whereas the second term in (l58l is due to 
the regularization. Then, if 

«E fc ,oo > and < 7 < (61) 

the second term on the RHS of d58l ) is negative and sparse diffusion would outperform standard diffusion. 
The condition as fc ,oo > 0, where o:£,i is given by d49b , is a necessary condition to have dominance of 
sparse diffusion over standard diffusion. Let us examine an interpretation for the condition as fc ,oo > 
in terms of the sparsity of the vector w°. Since /(•) is a real-valued convex function, by the definition 
of subgradient it holds that 

f(x + y)-f(x) >df(x) T y => -df(x) T y>f(x)-f(x + y) (62) 

Then, choosing x = and y = Bj^ k (1 ® w° — t«»_i), where B^ k = 2MAT<kA T (I — MV), we get 

as fc ,oo = -2 lim Edf(wi„i) T MAX k A T (I - MV) t&<_i 

i— >oo 

> lim E[/K_i) - /K_a + B Sk (l®w -w i ^ 1 ))} (63) 

1— >oo 

If the step-sizes are sufficiently small, we can approximate Bz k ±2 2AiAY,kA T , neglecting the second 
term that depends on {^}. Then, we have 

ibi = Wi-i + -B Sfc (l®w°- Wi-i) ^ Wi-i - 2MAY> k A T {wi„i - \®w°) (64) 

At convergence, the vector Wi-% fluctuates close to 1 ® w°. Now, since > 0, expression (|64l can be 
interpreted as a gradient descent update minimizing the function \\w — 1 ® u, °ll3lSi ; ^ T ' yielding for small 
step-sizes a vector ibi that is closer to 1 ®w° than If l<g>w° is sparse, the non-zero elements (NZ 

set) of the vector are in general much less in number than the zero elements (Z set). Then, the gradient 
update in (|64l helps move the components of the vector Wi that belong to the Z set closer to zero. 
Intuitively, if the Z set is larger than the NZ set, ibi will be more sparse than Wi-\. Thus, considering 
(l63l at convergence, since the function f(w) measures the sparsity of the vector w, it is expected that 

lim E[/(«7i_i) - f(wi)] > (65) 

i— >oo 
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since tUj is likely to be more sparse than Consequently, the condition as fc ,oo > is likely to 

be true. Therefore, by properly selecting the sparsity coefficient 7 to satisfy (loTT) . the sparse diffusion 
algorithm will yield better MSD than the standard diffusion algorithm at each node. On the other hand, 
if w° is not sparse, condition (l63l ) in general would not be true and the sparse diffusion algorithm will 
perform worse than standard diffusion. 



D. Adaptation of the Regularization Parameter 

To endow networks with the capability to adaptively exploit and track the sparsity of the system model, 
we now propose a systematic approach to choosing the regularization parameter 7 in an adaptive fashion. 
We thus allow the sparsity parameter to be iteration dependent, i.e., 7 = 73. Following similar steps as 
in Section III.B, we can replace ( f5Tb with the conditional relation: 

e [[it&iiilK-i] = +tt\y,a t mg t ma] +<h,i(rfi) ( 66 ) 

where £' is given by (|45T ) and 

<fe,i(7i) = 7i/3s,i (li - ( 67) 
ft* = 119/(^-1)11^4^ >0 (68) 
otE,i = —2df(wi-i) T MAEA T [I — MV] Wi-\ (69) 

Thus, letting £ = / and if <fe.i(7i) < 0, the sparse diffusion algorithm will outperform the standard 
diffusion algorithm in terms of the instantaneous MSD. The condition 0E,j(7i) < is satisfied when 

az ti > and < 7i < ^2* (70) 

Since 0s,i(7i) i n < l67l ) is quadratic in 73, we can choose the optimal parameter that minimizes (l67T l as: 

Now, exploiting the small step-sizes assumption in ( f69b . we consider the following approximation: 

os,i ^ -29/(w ! - 1 ) r MA4 T w i _i (72) 

An approximate expression for the sparsity parameter in (1711 ) is then given by: 

-df{w^ 1 ) T MAA T w i ^ 1 



7 - max i ' P7(^m^r/ (73) 

Remark 2: The rule (|73T ) cannot be directly used due to the presence of the true parameter vector w° 
in Wi-i, which is unknown to the nodes in the network. Furthermore, the update (l73l) depends on data 
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coming from all nodes. However, in the sequel we propose some useful approximations that allow the 
local computation of the regularization parameter. ■ 
First, we notice that the regularization parameter (f73l depends on the combination matrix A, which 
influences how the nodes perform the combination step in (|2TT i. This step helps improve the quality of the 
node's estimate Wk t i by reducing the effect of the measurement and gradient noises but, it generally has 
a marginal effect on the sparse recovery capability of the algorithm. The regularization function appears 
instead inside the adaptation step in (|2TI ). Thus, to simplify expression (|73T ). we consider the case in 
which we want to select 7, under the condition that A = I, i.e., no cooperation is performed among the 
nodes. In this case, the following relations hold: 

N 

ft* * 5>il|£>/Ki-i)|| 2 (74) 

k=l 

N 

as,i ^ -2^^ fe d/(u> fci i_i) T 7Z> fci j_i (75) 

k=l 

Let x = Wki-i and y = w° — Wki-i- Using d62b , we find that 

N N 

«e,i ~ -2 ^ n k df(w kti -.i) T w k! i-i > 2 ^ ft[fKi-l) - f( w °)} (76) 

k=l k=l 

In practice, some prior knowledge about the sparsity of the true vector w° is often available. For example, 
the ^i-norm of w° can be upper bounded by some constant value ||28l . In this work, we assume that 

f{w°) < V, (7V) 

for some given positive constant 77. Using (1771 ) in d76l ), we get 

N 

as,i > 2 ^2 Vk[f{wk,i-i) ~ rj\ (78) 
fc=i 

and, using d74b and d78l) . the regularization parameter in (ITTb can instead be approximated as: 



E£=iMfc[/(w*,i-i) -v] 



Remark 3: The update (1791 ) still depends on data coming from all nodes in the network. However, we 
can replace (1791 with a local rule where each node computes its own p° k i from data received from its 
neighbors only, say, 

<, = max(o,p^4g^i!^} ,80, 
In the simulation section, we will check the performance of the sparse diffusion strategy using (l80l l. ■ 
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We summarize below the sparse diffusion strategy with adaptive regularization. The complexity of this 
strategy is 0(4M), which is the same complexity as standard stand-alone LMS adaptation. 

ATC sparse diffusion LMS with adaptive regularization 

Start with Wk,-i = for all k. Given non-negative real coefficients {ai^c^} satisfying (fT9l) . for each 
time i > and for each node k, repeat: 

7*,i = max \ °' 2TT1J77 TTTo" r (sparsity control) 

^fc,i = Wk,i-\ + Mfc ^ QW,;K(i) - ««,i«>fc,i-l] - Vk7k,idf( w k,i-i) (adaptation step) (81) 

leAf k 

w k ,i = ^2 a i,ki>i,i (diffusion step) 



Remark 4: Equation (l80l indicates that, in order to ensure superiority of the sparse diffusion strategy, the 
construction (l80l is triggered only if X]zeA4 Pllf { w l,i-i) — ^1 > 0, otherwise, 7° = 0. The performance 
of the sparse diffusion strategy depends on how close the upper bound 77 is to the right value. In the 
simulation section, we will check the robustness of the regularized diffusion algorithm to misspecified 
values of 77. ■ 

IV. Simulation Results 

In this section, we provide some numerical examples to illustrate the performance of the sparse diffusion 
algorithm. In the first example, we compare the performance of the sparse diffusion strategy with respect 
to standard diffusion, considering fixed values of the regularization parameter 7. The second example 
shows the benefits of adapting the sparsity parameter according to (l80l ). 

Numerical Example 1 : Performance : We consider a connected network composed of 20 nodes. The 
topology of the network is shown in Fig. Q] The regressors u^j have size M = 50 and are zero-mean 
white Gaussian distributed with covariance matrices R u ^ = o - ^ k Iu, with k shown on the top right side 
of Fig. Q] The background white noise power a\ k of each node is depicted on the bottom right side of 
Fig. CD The first example aims to show the tracking and steady-state performance for the sparse diffusion 
algorithm. In Fig.|2j we report the learning curves in terms of network MSD for 6 different adaptive filters: 
ATC diffusion LMS GO, ZA-ATC diffusion described by <HD and |25]> and RZA-ATC diffusion described 
by (1211 ) and ( |28T ), and the non-cooperative approach from [34 1 . The simulations use a value of fi = 0.1 and 
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2 4 6 8 10 12 14 16 18 20 

Node number k 



Fig. 1: Network topology (left), noise variances (right, bottom) and regressor variances (right, top). 

the results are averaged over 100 independent experiments. The sparsity parameters are set to 7 = 5 x 10~ 3 
for ZA-LMS, 7 = 0.7 x 10~ 3 for RZA-LMS, j Z A = 10~ 3 for ZA-ATC, j RZ a = 0.25 x 10~ 3 for RZA- 
ATC, and e = 0.1. In this simulation, we consider diffusion algorithms without measurement exchange, 
i.e., C = I, and a combination matrix A that simply averages the estimates from the neighborhood such 
that ai : k = l/\N"k\ for all I. Initially, only one of the 50 elements of w° is set equal to one while the 
others are equal to zero, making the system very sparse. After 1000 iterations, 25 elements are randomly 
selected and set equal to 1, making the system have a sparsity ratio of 25/50. After 2000 iterations, 




500 1000 1500 2000 2500 3000 

Iteration index 



Fig. 2: Transient network MSD for the non-cooperative approaches LMS, ZA-LMS 1 34], RZA-LMS [34], 
and the diffusion techniques ATC H, ZA-ATC given by (EI])-(|25]>, RZA-ATC given by d2]])-([28]). 
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Fig. 3: Differential MSD versus sparsity parameter 7 for ZA-ATC Diffusion LMS (left) and for RZA-ATC 
Diffusion LMS (right), for different degrees of system sparsity. 

all the elements are set equal to 1, leaving a completely non-sparse system. As we see from Fig. [2] 
when the system is very sparse both ZA-ATC and RZA-ATC yield better steady-state performance than 
standard diffusion. The RZA-ATC outperforms ZA-ATC thanks to reweighted regularization. When the 
vector w° is only half sparse, the performance of ZA-ATC deteriorates, performing worse than standard 
diffusion, while RZA-ATC has the best performance among the three diffusion filters. When the system 
is completely non-sparse, the RZA-ATC still performs comparably to the standard diffusion filter. We 
also notice the gain of diffusion schemes with respect to the non-cooperative approaches from ll34l . 

The theoretical derivations in Section III showed that it is possible to select the regularization parameter 
7 in order to have dominance in terms of MSD of the ATC-SD filter with respect to the unregularized 
diffusion algorithm. To quantify the effect of the sparsity parameter 7 on the performance of the ATC-SD 
filters with respect to different degrees of system sparsity, we consider two additional examples. In Fig. 
|3] (left), we show the behavior of the difference (in dB) between the network MSD of ATC-ZA and 
standard diffusion versus 7, for different sparsity degrees of w°. We consider the same settings of the 
previous simulation and the results are averaged over 100 independent experiments and over 100 samples 
after convergence. As we can see from Fig. [3] (left), reducing the sparsity of w°, the interval of 7 values 
that yields a gain for ATC-ZA with respect to standard diffusion becomes smaller, until it reduces to zero 
when the system is not sparse enough. Different update functions may affect differently the steady-state 
performance of the ATC-SD algorithm. Thus, in Fig. |3](right), we repeat the same experiment considering 
the ATC-RZA algorithm. As we can see, thanks to the reweighted regularization in (|28l l, ATC-RZA gives 
better performance than ZA-ATC and yields a performance loss with respect to standard diffusion, for 
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Fig. 4: (Left) Transient network MSD for the diffusion techniques ATC JH, ZA-ATC described by (ED 
and (|25]>, RZA- ATC described by (ED and (EU>, and the sparse diffusion algorithms from |@4]. (Right) 
Transient network MSD for the diffusion techniques ATC [8|, ZA-ATC described by (|2TT i and (|25V 
RZA- ATC described by (|2TT > and (l28l ). and the projection based distributed learning technique from [43 1 . 



any 7, only when the vector w° is completely non-sparse. 

Finally, we compare our proposed sparse diffusion schemes with the sparsity promoting adaptive 
algorithms for distributed learning recently proposed in [43] and in Pffll . At the best of our knowledge, 
the works in ||43l and [44] are the only two present in the literature that exploit sparsity processing data 
both in an adaptive and distributed fashion. In Fig. |4] (left), we compare the steady-state performance, 
averaged over 100 independent simulations, of five adaptive filters: ATC diffusion LMS [8], ZA-ATC 
diffusion described by (ED and <E2]>, RZA- ATC diffusion described by (ED and (EU>, the ATC h-LMS 
and the ATC l\ -RWLMS algorithms from P4l . We consider a vector parameter w° with only 5 elements 
set equal to one, which have been randomly chosen, leading to a sparsity ratio of 5/50. The sparsity 
parameters are set to ^ Z A = 10~ 3 for ZA-ATC, jrza = 0.7 x 10~ 3 for RZA-ATC, and e = 0.1, 
for both our methods and the algorithms from Ifl4"l . The other settings are the same of the previous 
simulation, except that in this simulation the combination coefficients in (|2T1) are chosen as q t fe = 1/|A4| 
for all I, thus leading to ZA-ATC and RZA-ATC diffusion algorithms with measurement exchange. As 
we can notice from Fig. [4] (left), the proposed methods outperform the algorithms from [44 1 in terms 
of steady-state MSD. In Fig. [4] (right), we compare the transient network MSD of four adaptive filters: 
ATC diffusion LMS H, ZA-ATC diffusion described by (ED and ([25]), RZA-ATC diffusion described 
by (ED and (l28l) . and the projection based sparse learning from [43]. The settings of the ZA-ATC and 
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RZA-ATC diffusion algorithms are the same of the previous simulation, whereas the parameters of the 
algorithm from P31 are chosen in order to have similar steady-state MSD with respect to the RZA-ATC 
diffusion method. Using the same notation adopted in |43l , the parameters of the projection based filter 
are: e = 1.3 x max^cr^); p n = 0.06 x M. n \ the radius of the weighted l\ ball is equal to ||u>°||o = 5 
(i.e., the correct sparsity level); e n = 0.02; a = 0.99 for i < 300 and a = 0.8 for i > 300; the 
number of hyperslabs used per time update equals to q = 10. From Fig. @] (right), it is possible to notice 
how the projected based method has a larger convergence rate with respect to the RZA ATC diffusion 
method. This positive feature is paid in terms of computational complexity. Indeed, while our methods 
have an LMS type complexity 0(3M), the projection-based method from P31 has a complexity equal 
to 0(M(3 + q + log M)), due to the presence of q projections onto the hyperslabs and 1 projection on 
the weighted l\ ball per iteration. 

Numerical Example 2 - Adaptation of the Regularization Parameter: In this example, we consider the 
same network shown in Fig. [T] and the same setting of the previous simulation for the regression data and 
additive noise. The first example aims to show the tracking and steady-state performance of the ATC-SD 
algorithm with adaptive regularization. In Fig. [5] (left), we report the learning curves in terms of network 
MSD for 3 different adaptive filters: ATC diffusion LMS flU, ZA-ATC diffusion described by (Efl and 
d25l )) and RZA-ATC diffusion described by (|2TT > and d28l , when the regularization parameter 7, is chosen 
locally at each node according to the adaptive rule (l80l . The simulations use a value of p, = 0.1 and 
the results are averaged over 100 independent experiments. The approximation parameter for RZA-ATC 
diffusion in d28l is chosen equal to e = 0.1. Initially, only one of the 50 elements of w° is set equal to 
one while the others are equal to zero, making the system very sparse. After 1000 iterations, 5 elements 
are randomly selected and set equal to 1, making the system have a sparsity ratio of 5/50. After 2000 
iterations, all the elements are set equal to 1, leaving a completely non-sparse system. The upper bound i] 
in (I77T ). used to evaluate the sparsity parameter in (l80l . is set to -q = ||to ||i and varies in time according to 
the different choices of w°. As we can see from Fig. [5] (left), when the system is very sparse both ZA-ATC 
and RZA-ATC yield better steady-state performance than standard diffusion. The RZA-ATC outperforms 
ZA-ATC thanks to the reweighted regularization. When the vector w° is less sparse, the performance 
of ZA-ATC deteriorates, getting closer to standard diffusion, while RZA-ATC still guarantees a large 
gain. When the system is completely non-sparse, the three filters have the same performance. To see the 
effect of different sparsity ratios of the vector w° on the choice of the regularization, in Fig. [5] (right) we 
show the average behavior of the parameter 7? evaluated according to d80l , for ZA-ATC diffusion and 
RZA-ATC diffusion, averaged across nodes over 100 independent realizations. As we can see, the system 
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Fig. 5: (Left) Transient network MSD for the the diffusion techniques ATC GO, ZA-ATC described by (l2~T1 ) 
and (1251 ), RZA-ATC described by (|2TT ) and d28l ) with adaptive selection of the regularization parameter 
7i. (Right) Temporal behavior of the regularization parameter 7, evaluated through the adaptive relation 
(USB for ZA-ATC diffusion (solid) and RZA-ATC diffusion (dashed). 



reacts to different sparsity ratios of the vector w°, adjusting accordingly the regularization parameter 7? 
in order to improve the performance of the ATC-SD strategy with respect to the unregularized algorithm. 
From Fig. [5] (right), it is interesting to note how the regularization parameter converges close to the 
minimum of the Differential MSD plotted in Fig. [3] for both ZA-ATC and RZA-ATC. In particular, 7? is 
forced to zero when the vector w° is totally non-sparse, leading to the same performance of the standard 
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Fig. 6: Sensitivity of ZA-ATC diffusion and RZA-ATC diffusion to misspecifications of the trigger 
parameter rj. 
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diffusion algorithm. 

Since the adaptive update of the sparsity parameter 7° in (l80l depends on the selection of the trigger 
77, which depends on some available prior knowledge on the sparsity level of w°, it is important to 
check the sensitivity of the ATC-SD algorithm to misspecified values of 77. Thus, in Fig. |6l we report 
the average behavior of the MSD, for ZA-ATC diffusion and RZA-ATC diffusion, versus a percentual 
error on the specification of the true trigger value 77. The settings are the same of the previous simulation 
and the results are averaged over 100 independent experiments and over 100 samples after convergence. 
We consider a vector parameter w° with only 5 elements set equal to one, which have been randomly 
chosen, leading to a sparsity ratio of 5/50. In this case, the true value for the trigger parameter 77 would 
be equal to = 5. The regularization parameter 7$ is chosen locally at each node according to 

the adaptive rule (|80l ). As we can notice from Fig. [6] the ZA-ATC diffusion algorithm is very sensitive 
to misspecified values of 77, especially in the case of under-estimation of the trigger parameter. Indeed, 
by under-estimating the value of 77, the system would try to increase the sparsity parameter 74, in order 
to make the solution more sparse. Thus, as we notice from Fig. [6j being the true vector w° not sparse 
enough with respect to the selection of the trigger 77, the system determines an increment of the bias that 
strongly affects the performance. On the contrary, from Fig. [6l we notice how the RZA-ATC diffusion 
algorithm is robust to errors in the selection of the trigger parameter 77. This benefit is again due to 
regularization, whose presence reduces the magnitude of the bias, improving the estimation capabilities 
of the algorithm and relaxing the choice of the system parameters. 

V. Conclusion 

In this paper we proposed a class of diffusion LMS strategies, regularized by convex sparsifying 
penalties, for distributed estimation over adaptive networks. Two different penalty functions have been 
employed: the l\-novm, which uniformly attracts to zero all the vector elements, and a reweighted function, 
which better approximates the i^-novm, selectively shrinking only the elements with small magnitude. 
Convergence and mean-square analysis of the sparse adaptive diffusion filter show under what conditions 
we have dominance of the proposed method with respect to its unregularized counterpart in terms of 
steady-state performance. Further analysis leads to a procedure to update the regularization parameter of 
the algorithm, in order to ensure dominance of the sparse diffusion filter with respect to its unregularized 
version. In this way, the network can adjust in real-time the system parameters to improve the estimation 
performance, according to the sparsity of the underlying vector. Several numerical results show the 
potential benefits of using such strategies. 
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Appendix A 
Proof of Theorem 1 

Letting B = A T (I - MV) and 6* = jA T M ■ Ed/(wj_i), recursion (|39) gives 

i-l 

Eibi = ffEtbo + J] S"6i_ n (82) 

n=0 

where E«)o is the initial condition. As long as we can show that both terms on the right hand side of (l82l 
converge as i goes to infinity, then we would be able to conclude the convergence of EtWj. To proceed, 
we call upon results from ifTOl , ifTTI . ||91 . Let z = col{zi, z^, ... , £jv} denote a vector that is obtained by 
stacking N subvectors of size Mxl each (as is the case with ibi). The block maximum norm of z is 
defined as 

Iklkoo = max ||2: fc ||, (83) 

l<fc<Af 

where || • || denotes the Euclidean norm of its vector argument. Likewise, the induced block maximum 
norm of a block matrix X with M X M block entries is defined as: 

1 1 1 1 II ^ 1 1 \) OO 

|| oc = max ' . (84) 

z ¥=0 \\Z\\b,oo 

It is easy to check that the first term on the RHS of (|82l converges to zero as i — > oo. Indeed, note that 

||B*Eu;o||oo < \\B\\i iO0 ■ ||Eu>o||6,oo -> (85) 

if we can ensure that ||<B||& j00 < 1. This condition is actually satisfied by (l40l) . To see this, we invoke the 
triangle inequality of norms to note that 

ll^lkoo = ||^ T (/ - MV)\\ hoo < \\A T \\ b ^ ■ \\I - MV\\ b ^ = \\I - MV\\ byOQ (86) 

since ||^ T || 6oo = 1 in view of the fact that A is a left-stochastic matrix [10]. Therefore, to satisfy 
ll^ll&.oo < 1, it suffices to require 

\\I-MV\\ h>00 < 1. (87) 

Now, we recall a result from |flT). ll48l on the block maximum norm of a block diagonal and Hermitian 
matrix X with M x M blocks {Xk}, which states that 

||-Y|| 6 oo = max p(X k ) (88) 

Thus, since A4 is diagonal, condition (l87l) will hold if the matrix / — MT> is stable. Using (l38l ). we can 
easily verify that this condition is satisfied for any step-sizes satisfying d40l) . as claimed before. Therefore, 
when the step-sizes satisfy condition (l40l . the first term on the RHS of ( f82l) will converge to zero. We 
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will show next that condition (1401 ) also implies that the second term on the RHS of (1821) asymptotically 
converges to a finite value, thus leading to the overall convergence of the recursion (l82l . 

One effective tool to prove convergence of a series is the comparison test BT1 p. 14]: a series is 
absolutely convergent if each term of the series can be bounded by a term of an absolutely convergent 
series. Thus, denoting by [x]k the k-th entry of a vector x, it suffices to show that the series 

oo 

£ P" 6 *-*]* ( 89 ) 

n=0 

converges for each k = 1, . . . , NM. Now, each term of the series in ((891 can be bounded as: 

[B n bi- n ] k < \[B n bi- n ] k \ < ||B n 6 i _ n || 6l00 < ||B||? >0O • ||&i-J 6> oo < $ n ■ Vax (90) 
where 5 = \\B\\j, j00 and 

"max — max ||7^ r X-Ea/K_i)|| fei00 (91) 

3 

The second inequality in d90l holds because the block maximum norm of a vector is greater than or 
equal to the largest absolute value of its entries. The scalar & max is finite for the following reason. First, 
note that the subgradient vector df{wi-\) has bounded entries. In particular, <3/ max < \[M for the ZA 
update in (l2"5i and <9/ max < VM je for the RZA update in (l2"8l We further note that ||*4 T || b ^ = 1 and 
koo = Mmax- It follows that 

frmax < max7 • ||.A T || 6)00 • || A4[| fo oo • ||Eo>/(i«i_i)|| 6oo = 7 • fi max ■ 9/ max (92) 



Now, if condition (1401 is satisfied, then 5 = ||i3||&.oo < 1 an d 



oc 



£ 5n ■ b ™ = fT7 (93) 

n=0 

which means that the series (|93l ) and, consequently, the series d89l ), are absolutely convergent. In summary, 
since both first and second term on the RHS of d82l asymptotically converge to finite values, we conclude 
that Kibi will converge to a steady-state value. Now, taking the limit of 09l ) as i — > oo, it is easy to 
derive a closed form expression for the bias: 

bias = lim Eibi = 7 • \l — A T (I — MV)} ~ l A T M lim Edfiw^) (94) 

i— >oo i— >oo 

Moreover, exploiting (|90l , (l92l and (|93l ), we further note that 



|bias||b oo = lim ||Ett;i||& l00 = lim 

i— >oo i—too 



i-1 



n=0 



i-1 



< .limEll^-nlUoo (95) 



, ,. V^noiin 111 11 ^ &max ^ 7 ' A'-max " 9f max 

* }^\\ B koo\\bi-n\\ b ^<— § < — § (96) 

n=0 



This completes the proof of Theorem 1 . 
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Appendix B 
Proof of Theorem 2 



From (@7)-(@9) we have 

^(7)=7 2 E||5/K_ 1 )|| 



MAT,A T M 



2 1 mf{w % _ l ) T MAZA T (I - MV) ibi-i 



(97) 



Since, as noted in Appendix A, df(-) is a bounded function for all i, the term fi^i in d48l > can be upper 
bounded by a positive constant term p\ for all i. The term as^ in d49l can be written as Ec^_ 1 tOj_i 
where the vector 



Ci_i = — 2(1 — MP) J ^S^ T 7W0/(wi_i 



is again bounded for all i. Thus, we have 

M 



m=l 



Cm,!-l^m,i-l 



M 



m=l 



< c max l |EtZ)i_i| 



(98) 



(99) 



where c max = maxj |c m j_i|. As shown in Appendix A, the evolution of E«>j_i is given by (1821 . which, 
for any finite initialization vector wq, converges as i — > oo and cannot diverge for all i, if the step-sizes 
are chosen to satisfy (l4Qb . Consequently, |EtOj_x| can be upper bounded by some positive constant vector 
P2 for all i. Thus, letting r = vec{A T M.Q T M.A), expression (153T ) can be upper bounded as 



E||t&i||* < E||t6i_i||3r <r + r i <7+p 3 



(100) 



where p^ = j 2 pi + jc max l T p2 > 0. The positive constant p% can be related to the quantity r T a through 
some constant v 6 M + , say, p% = vr T a. Relation (1 100b is an inequality, which can be used to prove 
convergence of the sequence E[|ti>i[|^ to a bounded region instead of a fixed point. Alternatively, we 
convert (1 1001 ) into an equality recursion as follows: 



E\\wi\\l = fliEHt&i-ill^ + 0i(l + v)r T a 



(101) 



for some coefficient Q\ G [0, 1] that depends on both E||tWj||^ and EUtUj-iUj-g.. Recursion (1 1 1 b leads to: 

i-l 



Ellw,- 



\[e l n\w4% a + {l+v)r T Y, n 6n Pa (l02) 

i=l J 1=0 ln=i-l 

where IE 1 1 -ai? o 1 1 2 is the initial condition. We first note that if F is stable, F 2 — > as % — > oo. In this way, 
the first term on the RHS of (11021 ) vanishes asymptotically. Now, proceeding as in Appendix A, we can 
use the comparison test ||5T1 p. 14] to prove that, if J 7 is a stable matrix, the second term on the RHS 
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of (11021 ) is an absolutely convergent series. Thus, denoting again by [x]k the k-th entiy of a vector x, it 
suffices to show the convergence of the series: 



oo 
1=0 



i)Fa 



(103) 



with = Ylli = i-i Gn, for k = 1, ... , NM. Each term of the series in (1 1 Q3t can be bounded as: 



< 



< II^CrlL oo < ||J rZ ||6,oo||o'||6,oo 



(104) 



where the second inequality in (|104| l holds because the coefficients G [0, 1] for all i, whereas the 
third inequality in (11041 ) holds because the block maximum norm of a vector is greater equal than the 
largest absolute value of its entries. A known result in matrix theory ||49l p. 30] states that for every 
square stable matrix F, and every e > 0, there exists a submultiplicative matrix norm || • \\ p such that 

\\F\\ p = p{F) + e (105) 

Since F is stable, p(F) < 1, we can choose e > such that p(F) + e = £ < 1. Now, since in a finite 
dimensional space all norms are equivalent [50], we have || • \\b,oo < C ' II ' llp> f° r some positive constant 
£. Thus, we have 



ii^ii6,oo<c-n^iip<c-imiJ> = c-^ 

and, substituting (11061 ) into (11041 ). we get 



6,oo 



kll^ < C • IN 



6,oo 



EC 



C ■ \W\\b,c 

1-f 



(106) 



(107) 



z=o /=0 
which means that the series (11071 ) and, consequently, the series (11031 ). are absolutely convergent. In 

summary, since both the first and second terms on the RHS of (11021 ) asymptotically converge to finite 

values, we conclude that E||t«j||^ will converge to a steady-state value, thus completing our proof. 

Appendix C 
Existence of as,oo 

Let us consider the bounded random vector Cj in (|98l ), which is independent of the noise sequence 
v k (i) for all k,i. Letting B = A T (I - MV) and b t = ^A T Mdf{w i _ 1 ), from d37]), we get 



i-l 



a Sj00 = lim Ecfwi = lim Ecf&ibo + lim V Ec[B n bi 



(108) 



71=0 



where wq is the initial condition. Following the same steps as in Appendix A, if the step-sizes satisfy 
condition d40l , the first term on the RHS of (11081 ) will converge to zero. Furthermore, since the vector 
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sequence c, is bounded, similarly to what we have done in (|89ll-(l93T), we can again use the comparison 
test [51, p. 14] to prove that the second term on the RHS of (11081 ) asymptotically converges to a finite 
value, thus leading to the existence of the limit in d 108b - 
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